- This data set contains informationabout 10,000 movies collected fromThe Movie Database (TMDb),including user ratings and revenue.
- This dataset was generated from The Movie Database API.
- If you are curious about how this dataset was prepared, the code to access TMDb's API is posted here.
- Relevant data to be used in this dataset analysis includes the following variables:
- Original title
- Main Genres
- Release date
- release year
- Budget
- Revenue
- Main actor
- Month of release
Through out the report, I explore the following questions:¶
- How is the trend of films budget accross the years?
- How is the trend of films revenue over the years?
- How is the popularity & ratings of movies affected by the Genre?
- how does profitablity vary for films released during different months?
- how has profitability of making films changed over time?
import pandas as pd
import numpy as np
import calendar
import matplotlib.pyplot as plt
import plotly.express as px
# reading the data and getting some information about it.
df = pd.read_csv('tmdb-movies.csv')
df.head(3)
| id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | ... | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | http://www.jurassicworld.com/ | Colin Trevorrow | The park is open. | ... | Twenty-two years after the events of Jurassic ... | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 6/9/2015 | 5562 | 6.5 | 2015 | 137999939.3 | 1.392446e+09 |
| 1 | 76341 | tt1392190 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | http://www.madmaxmovie.com/ | George Miller | What a Lovely Day. | ... | An apocalyptic story set in the furthest reach... | 120 | Action|Adventure|Science Fiction|Thriller | Village Roadshow Pictures|Kennedy Miller Produ... | 5/13/2015 | 6185 | 7.1 | 2015 | 137999939.3 | 3.481613e+08 |
| 2 | 262500 | tt2908446 | 13.112507 | 110000000 | 295238201 | Insurgent | Shailene Woodley|Theo James|Kate Winslet|Ansel... | http://www.thedivergentseries.movie/#insurgent | Robert Schwentke | One Choice Can Destroy You | ... | Beatrice Prior must confront her inner demons ... | 119 | Adventure|Science Fiction|Thriller | Summit Entertainment|Mandeville Films|Red Wago... | 3/18/2015 | 2480 | 6.3 | 2015 | 101199955.5 | 2.716190e+08 |
3 rows × 21 columns
#short description of our data:
df.describe()
| id | popularity | budget | revenue | runtime | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 10866.000000 | 10866.000000 | 1.086600e+04 | 1.086600e+04 | 10866.000000 | 10866.000000 | 10866.000000 | 10866.000000 | 1.086600e+04 | 1.086600e+04 |
| mean | 66064.177434 | 0.646441 | 1.462570e+07 | 3.982332e+07 | 102.070863 | 217.389748 | 5.974922 | 2001.322658 | 1.755104e+07 | 5.136436e+07 |
| std | 92130.136561 | 1.000185 | 3.091321e+07 | 1.170035e+08 | 31.381405 | 575.619058 | 0.935142 | 12.812941 | 3.430616e+07 | 1.446325e+08 |
| min | 5.000000 | 0.000065 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 10.000000 | 1.500000 | 1960.000000 | 0.000000e+00 | 0.000000e+00 |
| 25% | 10596.250000 | 0.207583 | 0.000000e+00 | 0.000000e+00 | 90.000000 | 17.000000 | 5.400000 | 1995.000000 | 0.000000e+00 | 0.000000e+00 |
| 50% | 20669.000000 | 0.383856 | 0.000000e+00 | 0.000000e+00 | 99.000000 | 38.000000 | 6.000000 | 2006.000000 | 0.000000e+00 | 0.000000e+00 |
| 75% | 75610.000000 | 0.713817 | 1.500000e+07 | 2.400000e+07 | 111.000000 | 145.750000 | 6.600000 | 2011.000000 | 2.085325e+07 | 3.369710e+07 |
| max | 417859.000000 | 32.985763 | 4.250000e+08 | 2.781506e+09 | 900.000000 | 9767.000000 | 9.200000 | 2015.000000 | 4.250000e+08 | 2.827124e+09 |
#getting information of all Null values
df.isnull().sum()
id 0 imdb_id 10 popularity 0 budget 0 revenue 0 original_title 0 cast 76 homepage 7930 director 44 tagline 2824 keywords 1493 overview 4 runtime 0 genres 23 production_companies 1030 release_date 0 vote_count 0 vote_average 0 release_year 0 budget_adj 0 revenue_adj 0 dtype: int64
#removing and rechecking dublicated values
df.drop_duplicates(inplace=True)
df.duplicated().sum()
0
- we read the dataset.
- we took a peak on the values of the dataset.
- we checked the summary information.
- We checked the null values so we can decide what to do with it.
- We removed the duplicated rows.
# changing the release date format into datetime format
# creating a column ('release month') for the month of release.
df['release_date'] = pd.to_datetime(df['release_date'],format='%m/%d/%Y')
df['release_month'] = df['release_date'].dt.month
look_up = {1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 6: 'Jun', 7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'}
df['release_month'] = df['release_month'].apply(lambda x: look_up[x])
df.head(2)
| id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | ... | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | release_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | http://www.jurassicworld.com/ | Colin Trevorrow | The park is open. | ... | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 2015-06-09 | 5562 | 6.5 | 2015 | 137999939.3 | 1.392446e+09 | Jun |
| 1 | 76341 | tt1392190 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | http://www.madmaxmovie.com/ | George Miller | What a Lovely Day. | ... | 120 | Action|Adventure|Science Fiction|Thriller | Village Roadshow Pictures|Kennedy Miller Produ... | 2015-05-13 | 6185 | 7.1 | 2015 | 137999939.3 | 3.481613e+08 | May |
2 rows × 22 columns
# creating a column for the main Actor (first actor) for every movie.
# creating a column for the main genre of movies to analyse it.
# dropping all unnecessary columns to clean up.
df['MainActor'] = df.cast.str.split("|",expand=True,)[0]
df['MainPrdCompany'] = df.production_companies.str.split("|",expand=True,)[0]
df['MainGenre'] = df.genres.str.split("|",expand=True,)[0]
df['profit']= df['revenue']-df['budget']
df.drop(['overview', 'imdb_id', 'homepage', 'tagline', 'keywords', 'budget', 'revenue', 'cast', 'genres'], axis = 1, inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 10865 entries, 0 to 10865 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 10865 non-null int64 1 popularity 10865 non-null float64 2 original_title 10865 non-null object 3 director 10821 non-null object 4 runtime 10865 non-null int64 5 production_companies 9835 non-null object 6 release_date 10865 non-null datetime64[ns] 7 vote_count 10865 non-null int64 8 vote_average 10865 non-null float64 9 release_year 10865 non-null int64 10 budget_adj 10865 non-null float64 11 revenue_adj 10865 non-null float64 12 release_month 10865 non-null object 13 MainActor 10789 non-null object 14 MainPrdCompany 9835 non-null object 15 MainGenre 10842 non-null object 16 profit 10865 non-null int64 dtypes: datetime64[ns](1), float64(4), int64(5), object(7) memory usage: 1.5+ MB
dropping all duplicates
checking if we need to remove Nan values or not.
# dropping duplicates.
# No need to drop Nan values.
df.drop_duplicates(inplace=True)
df.isnull().sum()
id 0 popularity 0 original_title 0 director 44 runtime 0 production_companies 1030 release_date 0 vote_count 0 vote_average 0 release_year 0 budget_adj 0 revenue_adj 0 release_month 0 MainActor 76 MainPrdCompany 1030 MainGenre 23 profit 0 dtype: int64
df.rename(columns = {'budget_adj':'budget', 'revenue_adj':'revenue'}, inplace=True)
df.head()
| id | popularity | original_title | director | runtime | production_companies | release_date | vote_count | vote_average | release_year | budget | revenue | release_month | MainActor | MainPrdCompany | MainGenre | profit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 135397 | 32.985763 | Jurassic World | Colin Trevorrow | 124 | Universal Studios|Amblin Entertainment|Legenda... | 2015-06-09 | 5562 | 6.5 | 2015 | 137999939.3 | 1.392446e+09 | Jun | Chris Pratt | Universal Studios | Action | 1363528810 |
| 1 | 76341 | 28.419936 | Mad Max: Fury Road | George Miller | 120 | Village Roadshow Pictures|Kennedy Miller Produ... | 2015-05-13 | 6185 | 7.1 | 2015 | 137999939.3 | 3.481613e+08 | May | Tom Hardy | Village Roadshow Pictures | Action | 228436354 |
| 2 | 262500 | 13.112507 | Insurgent | Robert Schwentke | 119 | Summit Entertainment|Mandeville Films|Red Wago... | 2015-03-18 | 2480 | 6.3 | 2015 | 101199955.5 | 2.716190e+08 | Mar | Shailene Woodley | Summit Entertainment | Adventure | 185238201 |
| 3 | 140607 | 11.173104 | Star Wars: The Force Awakens | J.J. Abrams | 136 | Lucasfilm|Truenorth Productions|Bad Robot | 2015-12-15 | 5292 | 7.5 | 2015 | 183999919.0 | 1.902723e+09 | Dec | Harrison Ford | Lucasfilm | Action | 1868178225 |
| 4 | 168259 | 9.335014 | Furious 7 | James Wan | 137 | Universal Pictures|Original Film|Media Rights ... | 2015-04-01 | 2947 | 7.3 | 2015 | 174799923.1 | 1.385749e+09 | Apr | Vin Diesel | Universal Pictures | Action | 1316249360 |
df.loc[:, 'counter'] = 1
df.head()
| id | popularity | original_title | director | runtime | production_companies | release_date | vote_count | vote_average | release_year | budget | revenue | release_month | MainActor | MainPrdCompany | MainGenre | profit | counter | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 135397 | 32.985763 | Jurassic World | Colin Trevorrow | 124 | Universal Studios|Amblin Entertainment|Legenda... | 2015-06-09 | 5562 | 6.5 | 2015 | 137999939.3 | 1.392446e+09 | Jun | Chris Pratt | Universal Studios | Action | 1363528810 | 1 |
| 1 | 76341 | 28.419936 | Mad Max: Fury Road | George Miller | 120 | Village Roadshow Pictures|Kennedy Miller Produ... | 2015-05-13 | 6185 | 7.1 | 2015 | 137999939.3 | 3.481613e+08 | May | Tom Hardy | Village Roadshow Pictures | Action | 228436354 | 1 |
| 2 | 262500 | 13.112507 | Insurgent | Robert Schwentke | 119 | Summit Entertainment|Mandeville Films|Red Wago... | 2015-03-18 | 2480 | 6.3 | 2015 | 101199955.5 | 2.716190e+08 | Mar | Shailene Woodley | Summit Entertainment | Adventure | 185238201 | 1 |
| 3 | 140607 | 11.173104 | Star Wars: The Force Awakens | J.J. Abrams | 136 | Lucasfilm|Truenorth Productions|Bad Robot | 2015-12-15 | 5292 | 7.5 | 2015 | 183999919.0 | 1.902723e+09 | Dec | Harrison Ford | Lucasfilm | Action | 1868178225 | 1 |
| 4 | 168259 | 9.335014 | Furious 7 | James Wan | 137 | Universal Pictures|Original Film|Media Rights ... | 2015-04-01 | 2947 | 7.3 | 2015 | 174799923.1 | 1.385749e+09 | Apr | Vin Diesel | Universal Pictures | Action | 1316249360 | 1 |
fig = px.pie(df, values='counter', names='MainGenre', title='Percentages of different genres of movies')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()
In the pie chart above we can see that the genre of most produced movies were Drama (22.6%), Comedy (21.3%), Action (14.6%), Horror (8.42).
#Budget over the years line chart, looking
df_budgetvsyear= df[['budget','release_year']]
df_budgetvsyear= df_budgetvsyear.groupby('release_year').mean()
df_budgetvsyear = px.line(df_budgetvsyear, y="budget", title='Budget over the years')
df_budgetvsyear.show()
We can see here that movies average budget was fluctuating from 1960 till 1970, then it started increasing till it reached its climax in 1999, afterwards we can observe a steady decline in movies budgets.
#Budgets and revenues over the years line chart.
df_revenuevsyear= df[['budget','revenue','release_year']]
df_revenuevsyear= df_revenuevsyear.groupby('release_year').mean()
df_revenuevsyear.plot(figsize=(12,8))
plt.title('Revenues vs Budgets')
plt.ylabel('Value in billion USD')
plt.grid(True)
putting revenues with budgets gives us a solid perspective on the differences and the relation between their fluctuations.
# making a bar chart between the different Genres and average popularity as well as vote ratings.
df_ratingvsgenre= df[['MainGenre','vote_average', 'popularity']]
df_ratingvsgenre= df_ratingvsgenre.groupby('MainGenre').mean()
df_ratingvsgenre = px.bar(df_ratingvsgenre, y="popularity", color="vote_average", title="Genre vs Popularity")
df_ratingvsgenre.show()
here we have a bar chart featuring different genres in movies industry, while demonstrating both their popularity, and average vote ratings they get.
df[['profit', 'release_month']]
df_MonthVsProfit= df[['release_month','profit']]
df_MonthVsProfit= df_MonthVsProfit.groupby('release_month').mean()
df_MonthVsProfit = px.bar(df_MonthVsProfit, y="profit", title="profitablity vs release_month")
df_MonthVsProfit.show()
Here we can see that most profitable movies are produced mainly in summer, as well as December and November as the holidays come near.
1. Statistics showing that highest movies profits are made in summer, as well as before holidays in November, and December.
2. Drama, Comedy, Action are the most produced movies can be an indication of their popularity, but we need to investigate that furthermore.
3. it is obvious from the below graph that Mega movies with huge budgets were flourishing from the period of 1990 till 2000, then it started to trend down from 33 million average budget in 1999 to 10.4 million in 2014.
4. when observing the relation between films budget and revenues, we find that both of them are directly proportional.
5. We can see that the two most popular Genres are Action and science fiction movies, and although Documentaries are not as popular but it got the highest ratings of average rating of 6.9/10 followed by Music at 6.6/10.
1. We have used TMBD Movies dataset for our analysis and worked with popularity, revenue and runtime. Our analysis is limited to only the provided dataset. For example, the dataset does not confirm that every release of every director is listed.
2. There is no normalization or exchange rate or currency conversion is considered during this analysis and our analysis is limited to the numerical values of revenue.
3. Dropping missing or Null values from variables of our interest might skew our analysis and could show unintentional bias towards the relationship being analyzed. etc.